The following are the steps required in performing the necessary analysis on the textual corpus of accident reports. Each of these steps further contain multiple tasks which are discussed in detail under their respective sections.
The datasets are obtained from "osha.gov" website. Each accident report contains the Summary along with Report ID, Event Date and Degree. The last column Degree explains whether the accident resulted in a fatality or hospitalization of the victim. This feature can become useful (as a dependent variable) for text classification.
There are two data files for different naics codes, we will import them both into a pandas dataframe and later combine their rows to be sorted by date.
import pandas as pd
#read in the data files
ncs_21311_df = pd.read_csv('data/naics_21311_data.csv', sep = ',', header=0, parse_dates = ['Event_Date'], na_values="")
ncs_237120_df = pd.read_csv('data/naics_237120_data.csv', sep = ',', header=0, parse_dates = ['Event_Date'], na_values="")
dfs = [ncs_21311_df,ncs_237120_df]
raw_data = pd.concat(dfs, ignore_index=True) #combine the two dataframes
raw_data.sort_values(by='Event_Date', inplace = True) #sort by event date
raw_data.reset_index(drop = True, inplace = True) #reset index from 0
Lets have a look into the data and its columns to observe any content of interest.
raw_data.tail()
Missing Values
raw_data.isnull().sum()
About 70% of Age information is missing, and 10 rows in Degree column are missing. We need to drop the Age row later for analysis and delete missing rows in Degree column.
raw_data = raw_data[pd.notnull(raw_data['Degree'])]
Creating Time field: The dataset has 1151 rows and 7 columns. Each summary report begins with a phrase that contains the time when the accident happened. Not all reports contain time, but most of them have it. It can be of some use while performing time based analytics/prediction in future. Keeping the time field in a separate column will be convenient for testing correlation rather than including it in the report summary. So, we will split our report text to create a new time column in the dataframe.
df = raw_data #add data to a temporary dataframe
""" All rows contain date-year, so split based on year in string.
For str.split() method specify the pattern as 1st argument,
We only need one split to happen - which is the 2nd argument,
expand=True returns a dataframe with 2 columns"""
tmp = df.Summary.str.split("20\d+", n=1, expand=True)
raw_data['Time'] = tmp[0]
raw_data = raw_data[['Report_ID', 'Event_Date', 'Age', 'Time', 'Nature', 'Establishment', 'Summary', 'Degree']]
raw_data.tail()
Clean Time Field : The time column consists of text string which needs to be transformed to contain only the time information in an appropriate format. So, the strings needs to be sliced and converted to a datetime format. Then the time accuracy should be reduced to hour of day.
# Time content is to be extracted from string and modify into a suitable format for coversion into time field
raw_data.Time = raw_data.Time.str.extract('(\d+:\d+ \w.\w.)') #select time based on regex
raw_data.Time = raw_data.Time.str.upper() #convert to uppercase
raw_data.Time = raw_data.Time.str.replace(".", "", n=-1) #remove periods
raw_data.Time = pd.to_datetime(raw_data.Time, errors='coerce', format='%I:%M %p') #convert to time
from datetime import datetime, timedelta
def hour_rounder(t):
# Rounds to nearest hour by adding a timedelta hour if minute >= 30
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
raw_data.Time = raw_data.Time.apply(lambda i: hour_rounder(i) if str(i)!="NaT" else i)
raw_data.Time = raw_data.Time.dt.hour
raw_data.Time = raw_data.Time.astype('category')
The missing values of Summary column are now filled and we shall now make a copy of the Summary column to perform text preprocessing
reports = raw_data['Summary']
reports.head()
# export as csv files
raw_data.to_csv('raw_data.csv', sep=',', encoding='utf-8', index = False)
Preprocessing is an important yet time-consuming process in analyzing textual data. It is vital because we need to convert human readable text into machine readable format which involves several steps with multiple tools or packages. Rather than importing all packages at the beginning, for now I will import each of them where they are used in their respective cells.
We will start with Text Normalization which includes:
Depending on our needs and nature of the data some of these steps might be redundant. We can remove or add some steps as we seem fit to not lose any important details in the data.
reports = reports.str.lower()
reports[3] #peek at a report
Generally numbers are not relevant for text analysis as it can be more difficult to extract context or meaning from numbers than from words. In our report, we can observe that the numbers used indicate time and date which is already provided in a separate field if we need them. The last line in the report contains vital information that "three employees are killed" in word form, thus we are not discarding any useful data.
reports = reports.str.replace("\d+", "", n=-1)
reports[3] #peek at a report
We can use the re regex and string library to select and remove any symbols like [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~]:
import re
import string # Along with importing string, we are also using the 're' Regex library
reports = reports.str.replace('[%s]' % re.escape(string.punctuation), '', n=-1)
reports[3] #peek at a report
We will use the pandas str.strip() method to remove ant leading or trailing whitespaces and re.sub method from Regex to remove any duplicate whitespaces between words.
reports = reports.str.strip()
reports = reports.str.replace(' +', ' ',n=-1)
reports[3]
Tokenization is the process of splitting the given text into smaller pieces called tokens. Stop words (or commonly occurring words) should be removed from the text data as they do not provide any value and also makes our word matrix much sparse later. For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries. Here we'll use nltk library which contains all basic stopwords in English language.
Apart from these words we can also remove some specific words by creating a word set.
import nltk
"""
If python gives out error for not finding some subpackages,
running these code lines may help:
nltk.download('stopwords')
nltk.download('punkt')
"""
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# tokenize
word_tokens = reports.apply(lambda r: word_tokenize(r))
word_tokens.head()
Apart from the common language stopwords explained above there are other kinds of words not useful for the analysis. For example these words can be names of months or any common words highly used but not excluded by the stopwords library. We will first remove known stopwords and common words and then check the data for most frequently occuring words.
# For now we will remove basic stopwords and month names from the tokens
rmv_words = {'january', 'february', 'march', 'april','may','june','july','august','september','october','november','december', 'am', 'pm', 'however', 'first', 'one', 'two', 'three', 'five'}
stop_words = set(stopwords.words('english'))
stop_words = stop_words.union(rmv_words)
word_tokens = word_tokens.apply(lambda t: [i for i in t if not i in stop_words])
word_tokens.head()
Now check the data for ten most frequent words and justify whether we need to remove or retain them.
freq = word_tokens.values.flatten().tolist()
freq = pd.Series(item for sublist in freq for item in sublist)
print(freq.value_counts()[:10])
print(len(freq.value_counts()))
The above words are all related to the accidents and we cannot immediately discard any of them. We need to observe how they are related to form relationships and if some of them are not adding value we can then remove them.
As we have seen common words, it is more important to check rare words as well. Because each word is a column in the sparse matrix and if a word is rare it mostly contains null values, thus adding complexity while developing a model for prediction or classification. Ignoring terms that have a document frequency lower than a given threshold can help generalization and prevent overfitting, we need to limit ourselves to the set of values that aid to show strong associations. So, to remove the sparse terms we need to set up a threshold, and to calculate an ideal threshold we need to look into the distribution of the word frequencies.
import matplotlib.pyplot as plt
dst = freq.value_counts()[1:] # exclude first value, it impairs detail in plot
dst.plot.bar(figsize=(20,5), title="Frequency Distribution Plot")
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)
plt.show()
plt.close()
The distribution is extremely skewed to right and logarithmic in nature. There are total of 2672 terms/words and most of them have a frequency less than 25. Due to this nature of data (as most values are of low frequencies) the distribution is flipped and the high frequency terms will be outliers. We can view these with the boxplot which can help us with the threshold to eliminate rare words.
dst.plot.box(figsize=(5,10), title="Box Plot")
plt.show()
plt.close()
As expected all rare words occupied the distribution which has an upper limit around 10. So, we will remove all rare which have repeat less than 10 times in the entire corpus.
rare_words = freq.value_counts()[freq.value_counts() < 10]
rare_words = list(rare_words.index)
len(rare_words)
The sparse matrix to be created is reduced close to 10% of its initial size. This makes it convenient to build a time and space efficient model that can generalize well. If necessary we can always trim down the model later based on its performance.
# Remove words from word tokens
word_tokens = word_tokens.apply(lambda t: [i for i in t if not i in rare_words])
Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.
# We will use the nltk package method to create tags
pos_tags = word_tokens.apply(lambda t: nltk.pos_tag(t))
print(pos_tags.head(),'\n\n', pos_tags[0])
We obtained POS tags for all of our words, but they are in what called as a Penn Treebank format. For the nltk's lemmatizer function to work we need the tags in a Wordnet format. Therefore I created a small mapper function to convert Penn tags to Wordnet tags below.
# part is a dictionary containing the Penn to Wordnet mapping as key, value pairs
part = {
'N' : 'n',
'V' : 'v',
'J' : 'a',
'S' : 's',
'R' : 'r'
}
def convert_tag(penn_tag):
"""
convert_tag() accepts the **first letter** of a Penn part-of-speech tag,
then uses a dict lookup to convert it to the appropriate WordNet tag.
"""
if penn_tag in part.keys():
return part[penn_tag]
else:
# other parts of speech will be tagged as nouns
return 'n'
wordnet_tags = pos_tags.apply(lambda p: [(word, convert_tag(tag[0])) for word, tag in p])
print(wordnet_tags.head(), '\n\n', wordnet_tags[0])
The POS tags are now converted into Wordnet format and we can input these now to the lemmatizer function.
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.
# download if 'wordnet' not in nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lmt_word_tkns = wordnet_tags.apply(lambda w: [lemmatizer.lemmatize(word[0], word[1][0]) for word in w])
print(lmt_word_tkns.head())
# Compare original words and lemmatized words
cmp_df = pd.DataFrame({'Original':word_tokens[1], 'Lemmatized':lmt_word_tkns[1]})
cmp_df
Finally the words are now reduced at a sufficient level to their base form. If we observe the words, employee turned out to be an important word, because each time employee is used the words following employee tell a story. However the above table is token list comparision for one document, the words for all documents need to be verified for duplicates and overwrite them with base words manually.
freq = lmt_word_tkns.values.flatten().tolist()
freq = pd.Series(item for sublist in freq for item in sublist)
all_terms = freq.value_counts()
all_terms = all_terms.index
print(sorted(list(all_terms)))
len(all_terms)
There are several words with extension to their base form, for example there are three words 'amputate', 'amputation', 'amputated' which are an extended from 'amputate'.
This problem can be handled in three steps:
# Create a python dictionary
rpl_str = { 'amputate' : ['amputated','amputation'],
'attach' : ['attached'],
'attempt' : ['attempted'],
'bag' : ['bags'],
'blow' : ['blew'],
'bolt' : ['bolts'],
'break' : ['broken'],
'catch' : ['caught'],
'cause' : ['causing'],
'collapse' : ['collapsed'],
'connect' : ['connection','connected'],
'coworker' : ['coworkers'],
'crush': ['crushing','crushed'],
'death' : ['dead','die','died'],
'drill' : ['driller','drilling'],
'employee' : ['employ'],
'excavate' : ['excavation','excavator'],
'explosion' : ['explode'],
'fall' : ['fell'],
'find' : ['found'],
'fracture' : ['fractured'],
'heat' : ['heater'],
'hospital' : ['hospitalize','hospitalized'],
'ignite' : ['ignited'],
'inch' : ['inches'],
'kill' : ['killing','killed'],
'leg' : ['legs'],
'lift' : ['lifted'],
'load' : ['loaded'],
'location' : ['locate'],
'low' : ['lower'],
'occur' : ['occurred'],
'operate' : ['operation','operator','operating','operated'],
'pipe' : ['pip','piping','pipeline','piped'],
'place' : ['placed','placing'],
'pressure' : ['pressurize','pressurized'],
'provide' : ['provided'],
'pull' : ['pulled'],
'receive' : ['received'],
'suffer' : ['suffered'],
'sustain' : ['sustained'],
'strike' : ['striking'],
'tank' : ['tanker','tanks'],
'throw' : ['thrown'],
'tube' : ['tubing'],
'transport' : ['transported','transporting'],
'treat' : ['treating','treated','treatment'],
'use' : ['used','using'],
'weld' : ['welding','welded']
}
new_word_tokens = lmt_word_tkns
new_word_tokens = new_word_tokens.apply(lambda t: ' '.join(t))
new_word_tokens = new_word_tokens.apply(lambda t: ' '+t+' ')
for k,v in rpl_str.items():
for i in range(len(v)):
new_word_tokens = new_word_tokens.str.replace(v[i]+" ", k+" ", n=-1)
new_word_tokens = new_word_tokens.apply(lambda t: t.strip())
new_word_tokens = new_word_tokens.apply(lambda t: t.split(' '))
freq = new_word_tokens.values.flatten().tolist()
freq = pd.Series(item for sublist in freq for item in sublist)
all_terms = freq.value_counts()
all_terms = all_terms.index
print(sorted(list(all_terms)))
len(all_terms)
A unique set of 310 words are now reduced to 247 words due to the mapper function. It implies 20% of redundancy will be reduced in the sparse matrix without loss of value. The next stage is to structurize word tokens for analysis and extract interesting features to detect patterns in the data.
This section deals with text structuring and advanced processing of accident reports data. The text until now is in unstructured format and to perform any analytics/modeling it should be converted to a numerical format. Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. A Word Embedding format generally tries to map a word using a dictionary to a vector.
Word embeddings can be either of frequency based or prediction based. Prominent techniques from both categories will be used to structurize the data, which will be further evaluated to perform appropriate analytics.
This is the most obvious technique to find out the relevance of a word in a document. The more frequent a word is, the more relevance the word holds in the context. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. Term frequency increases as the number of occurrences of that word within the document increases.
# Most Frequent Terms
freq = new_word_tokens.values.flatten().tolist()
freq = pd.Series(item for sublist in freq for item in sublist)
mft = freq.value_counts()[:30]
mft = ' '.join(mft.index)
We have extracted the top 30 frequent words, now let's plot a wordcloud to view them. This gives us a peek into the theme of words among all records.
# Word Cloud of Top 30 Most Frequent Terms
from wordcloud import WordCloud
wordcloud = WordCloud(width=600, height=400, min_font_size=10, max_font_size=60, background_color='white').generate(mft)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.close()
Now, we will move forward to generate a Term-Frequency sparse matrix using the scikit-learn library.
from sklearn.feature_extraction.text import TfidfVectorizer
cnvrt_to_str = new_word_tokens.apply(lambda t: ' '.join(t))
tfidf = TfidfVectorizer(use_idf=False, analyzer = 'word')
tf_matrix = tfidf.fit_transform(cnvrt_to_str)
tf_matrix
Now that we created a TF matrix which contains 415 reports and 317 word tokens, it is hard to view it in a table. So, let's plot a heatmap of word tokens vs. documents using the seaborn library.
import seaborn as sns
fig,ax = plt.subplots(figsize=(20,15))
sns.heatmap(tf_matrix.todense(), ax=ax)
plt.title("Heatmap of Term-Frequency sparse matrix")
plt.xlabel("Word Tokens")
plt.ylabel("Accident Reports")
plt.show()
plt.close()
It is easy to notice how sparse the matrix is with one bright streak for the word 'employee'. And there are some patterns where usage of some words change over time with reports. Note that all the reports are sorted from 2017 to 2013. But we cannot discard the sparse terms without finding out if they add to any associations between documents. Because some words can be used frequently in few reports to explain specific type of accidents but not used in other reports. To attain that knowledge we need to generate a TF-IDF matrix.
In information retrieval, TFIDF short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.
# Generate a tf-idf matrix
tfidf = TfidfVectorizer(use_idf=True, analyzer = 'word')
tfidf_matrix = tfidf.fit_transform(cnvrt_to_str)
tfidf_matrix
Similar to the term frequency matrix lets also plot a heatmap to find the differences between TF and TF-IDF matrices.
import seaborn as sns
fig,ax = plt.subplots(figsize=(20,15))
sns.heatmap(tfidf_matrix.todense(), ax=ax)
plt.title("Heatmap of Term Frequency - Inverse Document Frequency sparse matrix")
plt.xlabel("Word Tokens")
plt.ylabel("Accident Reports")
plt.show()
plt.close()
A quick difference to note is weightage for the word 'employee' is penalized as it occurs commonly between all documents. Similarly few other common words are penalized and several words weightage is increased. Overall there are subtle improvements in the sparsity of TF-IDF matrix. The next step is identify the top weighted words in the tf-idf matrix and compare them to the most frequent words to observe any differences.
# Create pandas DataFrame from scipy csr matrix format
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns = tfidf.get_feature_names())
tfidf_df.head()
# rows are accident reports
# columns are word tokens
col_agg = tfidf_df.agg("sum")
top_idf = col_agg.sort_values(ascending = False)[:30]
print(top_idf)
# Word Cloud of Top 30 TF-IDF terms
from wordcloud import WordCloud
top_idf = ' '.join(top_idf.index)
wordcloud = WordCloud(width=600, height=400, min_font_size=10, max_font_size=60, background_color='white').generate(top_idf)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.close()
# Compare top 30 terms between TF and TF-IDF matrices
pd.DataFrame({'TF' : mft.split(' '),
'TF-IDF' : top_idf.split(' ')})
Although mostly similar, several terms are promoted to a higher rank in TF-IDF when compared to TF. And miscellaneous words like 'approximately' and 'back' that are listed in TF do not exist in TF-IDF. Words like 'death', 'sustain' and 'injury' moved several ranks in TF-IDF. New word 'crush' is in the TF-IDF list but not in TF.
Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words.
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(cnvrt_to_str)
bow_df = pd.DataFrame(train_bow.toarray(), columns = bow.get_feature_names())
bow_df.head()
Bag of words is a preliminary technique yet the understanding it is essential for Continuou Bag of Words(CBOW), which is a more advanced and extended concept of Bag of Words. The methods discussed below are prediction based, in the sense that they provided probabilities to the words and proved to be state of the art for tasks like word analogies and word similarities. Another algorithm similar to CBOW in topology is Skip-Gram model, these two models together form an algorithm known as Word2Vec. Both of these models are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations.
Word2Vec transcends from structurizing tokens in the form of word embeddings to buidling neural networks that can be modeled to predict multiple context based situations.
This section deals with mining text data by using techniques to identify patterns and trends in text data with respect to other feature variables including target variable. Text mining acts as a preparatory stage to modeling text for classification. It is exploratory in nature and hence different techniques are used for identification of intelligent information among the variables in the dataset. Initially we will begin to explore each of the variables to know its range, distribution, number of levels and if necessary transform it to be able to apply and fit to the model well.
raw_data.Event_Date.describe()
# One time use - Initiate plotly
import plotly
plotly.tools.set_credentials_file(username='binti', api_key='dv1XHP8ygm00itAg9m4C')
import plotly.offline as off
import plotly.graph_objs as go
off.init_notebook_mode(connected=False)
data = [dict(
x = raw_data['Event_Date'],
autobinx = False,
autobiny = True,
#marker = dict(color = 'rgb(68, 68, 68)'),
name = 'date',
type = 'histogram',
xbins = dict(
end = '2017-12-31',
size = 'M1',
start = '2001-01-01'
)
)]
layout = dict(
#paper_bgcolor = 'rgb(240, 240, 240)',
#plot_bgcolor = 'rgb(240, 240, 240)',
title = '<b>Oil & Gas Industry Accidents</b>',
xaxis = dict(
title = 'Years',
tickangle=-90,
type = 'date',
nticks = 25
),
yaxis = dict(
title = 'Accident Count',
type = 'linear'
),
updatemenus = [dict(
x = 0.1,
y = 1.15,
xref = 'paper',
yref = 'paper',
yanchor = 'top',
active = 1,
showactive = True,
buttons = [
dict(
args = ['xbins.size', 'D1'],
label = 'Day',
method = 'restyle',
), dict(
args = ['xbins.size', 'M1'],
label = 'Month',
method = 'restyle',
), dict(
args = ['xbins.size', 'M3'],
label = 'Quater',
method = 'restyle',
), dict(
args = ['xbins.size', 'M6'],
label = 'Half Year',
method = 'restyle',
), dict(
args = ['xbins.size', 'M12'],
label = 'Year',
method = 'restyle',
)]
)]
)
off.iplot({'data': data,'layout': layout}, validate=False)
As the Event_Date field is niether numerical nor categorical we cannot use it directly for our analysis. And date itself is a very granular field which makes it further difficult for analysis. Hence new columns must be derived from date that are of relevance to the data.
def add_datepart(df, fldname):
fld = df[fldname]
targ_pre = re.sub('[Dd]ate$','',fldname)
for n in ('Year','Month','Week','Day','Dayofweek','Is_month_end','Is_month_start','Is_quarter_end','Is_quarter_start','Is_year_end','Is_year_start'):
df[targ_pre+n] = getattr(fld.dt, n.lower())
df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
add_datepart(raw_data, 'Event_Date')
Day of Week vs Degree
dydg_df = pd.DataFrame(raw_data.Event_Date.dt.day_name())
dydg_df['Degree'] = raw_data.Degree
dydg_df = pd.DataFrame(dydg_df.groupby(['Event_Date','Degree']).size().reset_index(name='count'))
dydg_df = dydg_df.pivot(index='Event_Date', columns='Degree', values='count')
dydg_df.reset_index(inplace=True)
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=cats, ordered=True)
dydg_df.Event_Date = dydg_df.Event_Date.astype(cat_type)
dydg_df.sort_values('Event_Date', inplace=True)
trace1 = go.Bar(
x=dydg_df.Event_Date,
y=dydg_df.Fatality,
name='Fatality'
)
trace2 = go.Bar(
x=dydg_df.Event_Date,
y=dydg_df['Hospitalized injury'],
name='Hospitalized Injury'
)
trace3 = go.Bar(
x=dydg_df.Event_Date,
y=dydg_df['Non Hospitalized injury'],
name='Non Hospitalized Injury'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group',
title = '<b>Day of Week vs Degree</b>',
yaxis = dict(
title = 'Accident Count (2013-2017)',
type = 'linear'
),
xaxis = dict(
title = 'Days of Week'),
legend=dict(x=0, y=1.1, orientation='h')
)
fig = go.Figure(data=data, layout=layout)
off.iplot(fig, validate=False)
Month vs Degree
mdg = pd.DataFrame(raw_data.Event_Date.dt.month_name())
mdg['Degree'] = raw_data.Degree
mdg = pd.DataFrame(mdg.groupby(['Event_Date','Degree']).size().reset_index(name='count'))
mdg = mdg.pivot(index='Event_Date', columns='Degree', values='count')
mdg.reset_index(inplace=True)
cats = ['January', 'February', 'March', 'April','May','June', 'July', 'August','September', 'October', 'November', 'December']
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=cats, ordered=True)
mdg.Event_Date = mdg.Event_Date.astype(cat_type)
mdg.sort_values('Event_Date', inplace=True)
trace1 = go.Bar(
x=mdg.Event_Date,
y=mdg.Fatality,
name='Fatality'
)
trace2 = go.Bar(
x=mdg.Event_Date,
y=mdg['Hospitalized injury'],
name='Hospitalized Injury'
)
trace3 = go.Bar(
x=mdg.Event_Date,
y=mdg['Non Hospitalized injury'],
name='Non Hospitalized Injury'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group',
title = '<b>Month vs Degree</b>',
xaxis=dict(
tickangle=-45,
title = 'Months'),
yaxis = dict(
title = 'Accident Count (2013-2017)',
type = 'linear'
),
legend=dict(x=0, y=1.1, orientation='h')
)
fig = go.Figure(data=data, layout=layout)
off.iplot(fig, validate=False)
Time field is modified to contain only hourly info which accounts to 24 hours per day. That still might be needed to reduced to 4 levels for better performance of the model
Time vs Degree:
hdg = pd.DataFrame({'Time':raw_data.Time, 'Degree':raw_data.Degree})
hdg = pd.DataFrame(hdg.groupby(['Time','Degree']).size().reset_index(name='count'))
hdg = hdg.pivot(index='Time', columns='Degree', values='count')
hdg.reset_index(inplace=True)
trace1 = go.Bar(
x=hdg.Time,
y=hdg.Fatality,
name='Fatality'
)
trace2 = go.Bar(
x=hdg.Time,
y=hdg['Hospitalized injury'],
name='Hospitalized Injury'
)
trace3 = go.Bar(
x=hdg.Time,
y=hdg['Non Hospitalized injury'],
name='Non Hospitalized Injury'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group',
title = '<b>Hour of Day vs Degree</b>',
xaxis = dict(
title = 'Hours of Day (24 hour format)',
nticks = 24
),
yaxis = dict(
title = 'Accident Count',
type = 'linear'
),
legend=dict(x=0, y=1.1, orientation='h')
)
fig = go.Figure(data=data, layout=layout)
off.iplot(fig, validate=False)
It is clearly obvious that most accidents occur during 8am to 5pm period. There is a broader distinction between degree levels in the first half of the day. Instead of having 24 levels it helps to reduce the levels to four for every 6 hours. We will create a new variable Hour_Range for 7am-12pm, 1pm-6pm, 7pm-12am, 1am-6pm
raw_data['Hour_Range'] = raw_data.Time.cat.codes
raw_data['Hour_Range'] = raw_data.Hour_Range.astype('int32', errors='ignore')
def cnvrt_hour(val):
if val>=1 and val<=6:
return "1am-6pm"
elif val>=7 and val<=12:
return "7am-12pm"
elif val>=13 and val<=18:
return "1pm-6pm"
elif val==0 or (val>= 19 and val<=23):
return "7pm-12am"
else:
return None
raw_data.Hour_Range = raw_data.Hour_Range.apply(lambda i: cnvrt_hour(i))
cats = ["1am-6pm", "7am-12pm", "1pm-6pm", "7pm-12am"]
cat_type = CategoricalDtype(categories=cats, ordered=True)
raw_data.Hour_Range = raw_data.Hour_Range.astype(cat_type)
hrdg = pd.DataFrame({'Hour_Range':raw_data.Hour_Range, 'Degree':raw_data.Degree})
hrdg = pd.DataFrame(hrdg.groupby(['Hour_Range','Degree']).size().reset_index(name='count'))
hrdg = hrdg.pivot(index='Hour_Range', columns='Degree', values='count')
hrdg.reset_index(inplace=True)
cats = ["1am-6pm", "7am-12pm", "1pm-6pm", "7pm-12am"]
cat_type = CategoricalDtype(categories=cats, ordered=True)
hrdg.Hour_Range =hrdg.Hour_Range.astype(cat_type)
hrdg.sort_values('Hour_Range', inplace=True)
trace1 = go.Bar(
x=hrdg.Hour_Range,
y=hrdg.Fatality,
name='Fatality'
)
trace2 = go.Bar(
x=hrdg.Hour_Range,
y=hrdg['Hospitalized injury'],
name='Hospitalized Injury'
)
trace3 = go.Bar(
x=hrdg.Hour_Range,
y=hrdg['Non Hospitalized injury'],
name='Non Hospitalized Injury'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group',
title = '<b>Hour of Day vs Degree</b>',
yaxis = dict(
title = 'Accident Count',
type = 'linear'
),
legend=dict(x=0, y=1.1, orientation='h')
)
fig = go.Figure(data=data, layout=layout)
off.iplot(fig, validate=False)
import seaborn as sns
g = sns.catplot(x='Nature', y='Age', hue='Degree', kind='swarm', data=raw_data, height=5, aspect=2)
g.set_axis_labels("Nature of Accidents","Age")
g.set_xticklabels(rotation=90)
raw_data.Establishment = raw_data.Establishment.astype('category')
estdf = pd.DataFrame(raw_data.Establishment)
estdf['Degree'] = raw_data.Degree
estdf = pd.DataFrame(estdf.groupby(['Establishment','Degree']).size().reset_index(name='count'))
estdf = estdf.pivot(index='Establishment', columns='Degree', values='count')
estdf.reset_index(inplace=True)
trace1 = go.Bar(
x=estdf.Establishment,
y=estdf.Fatality,
name='Fatality'
)
trace2 = go.Bar(
x=estdf.Establishment,
y=estdf['Hospitalized injury'],
name='Hospitalized Injury'
)
trace3 = go.Bar(
x=estdf.Establishment,
y=estdf['Non Hospitalized injury'],
name='Non Hospitalized Injury'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group',
title = '<b>Establishment vs Degree</b>',
yaxis = dict(
title = 'Accident Count',
type = 'linear'
),
legend=dict(x=0, y=1.1, orientation='h')
)
fig = go.Figure(data=data, layout=layout)
off.iplot(fig, validate=False)
Data Setup
# Combine all predictor variables and prep data
dfmod = pd.concat([raw_data.iloc[:,[4,5]],raw_data.iloc[:,8:-1]], axis=1)
dfmod.reset_index(drop = True, inplace = True)
dfmod = pd.concat([dfmod, tfidf_df], axis=1)
dfmod.Nature = dfmod.Nature.astype('category').cat.as_ordered()
dfmod.Nature = dfmod.Nature.cat.codes+1
dfmod.Establishment = dfmod.Establishment.astype('category').cat.as_ordered()
dfmod.Establishment = dfmod.Establishment.cat.codes+1
y = raw_data.Degree
dfmod.tail()
Minor Data Processing
# reduce to 2 levels on y-degree
y = y.astype('category')
# center and scale columns in the dataframe
from sklearn import preprocessing
names = dfmod.columns
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(dfmod)
scaled_df = pd.DataFrame(scaled_df, columns=names)
Model Selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
import warnings; warnings.simplefilter('ignore')
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
AdaBoostClassifier(random_state=1),
GradientBoostingClassifier(learning_rate=0.01,random_state=1),
xgb.XGBClassifier(random_state=1,learning_rate=0.01)
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
features = dfmod.iloc[:,14:]
labels = raw_data.Degree.astype('category').cat.as_ordered()
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
f, ax = plt.subplots(figsize=(15, 5))
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df,
size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.title('Model Performance Comparision Boxplots')
plt.show()
plt.close()
print("Mean Accuracies")
print(cv_df.groupby('model_name').accuracy.mean())
print('\nMedian Accuracies')
print(cv_df.groupby('model_name').accuracy.median())
Model Evaluation
Lets continue with our best models, we are going to look at the confusion matrix, and show the discrepancies between predicted and actual labels.
Linear SVC
model = LinearSVC()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(tfidf_df, labels, df.index, test_size=0.25, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:',model.score(X_test,y_test))
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=y.unique()))
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=labels.cat.categories, yticklabels=labels.cat.categories)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix for Random Forest Classifier')
plt.show()
plt.close()
XGBoost
import xgboost as xgb
from sklearn.model_selection import train_test_split
#from xgboost.sklearn import XGBClassifier
model = xgb.XGBClassifier(
learning_rate =0.2,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
scaled_pos_weight=1,
colsample_bytree=0.8,
seed=27,
n_jobs=4,
objective= 'multi:softmax',
num_class = 3)
xgb1 = model
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(tfidf_df, labels, df.index, test_size=0.25, random_state=0)
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)
print('Accuracy:', model.score(X_test,y_test))
Step 1: Fix learning rate and number of estimators for tuning tree-based parameters The above is an XGBoost model tuned with basic common parameters, tested with a 25% test set. In the next iteration we will test out with Cross Validation to estimate optimal learning rate and estimators
# Wrapper function to run XGBoost with cross validation and to determine optimum learning rate, estimators
def modelfit(alg, dXtrain, dytrain, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
from sklearn.metrics import classification_report
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(dXtrain.values, label=dytrain.cat.codes)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics = 'merror', early_stopping_rounds=early_stopping_rounds, stratified=True)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(dXtrain, dytrain, eval_metric='merror')
#Predict training set:
dtrain_predictions = alg.predict(dXtrain)
dtrain_predprob = alg.predict_proba(dXtrain)[:,1]
#Print model report:
print("\nModel Classification Report")
print("Accuracy : %.4g" % metrics.accuracy_score(dytrain.values, dtrain_predictions))
report = classification_report(dytrain.values, dtrain_predictions)
print(report)
print(cvresult)
modelfit(xgb1, X_train, y_train)
Step 2: Tune max_depth and min_child_weight
from sklearn.model_selection import GridSearchCV #Perforing grid search
from xgboost.sklearn import XGBClassifier
import warnings; warnings.simplefilter('once')
param_test1 = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=39, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'multi:softmax', num_class=3, n_jobs=4, scale_pos_weight=1, random_state=27),
param_grid = param_test1, scoring='f1_micro',n_jobs=4,iid=False, cv=5, return_train_score=True)
gsearch1.fit(X_train,y_train)
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
Step 3: Tune Gamma
param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=39, max_depth=3,
min_child_weight=3, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'multi:softmax', num_class=3, n_jobs=4, scale_pos_weight=1, random_state=27),
param_grid = param_test3, scoring='f1_micro',n_jobs=4,iid=False, cv=5, return_train_score=True)
gsearch1.fit(X_train,y_train)
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
from datetime import datetime
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
# create timer func
def timer(start_time=None):
if not start_time:
start_time = datetime.now()
return start_time
elif start_time:
thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
tmin, tsec = divmod(temp_sec, 60)
print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 2, 3, 4],
'gamma': [0, 0.2, 0.4, 0.6, 0.8],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [1, 2, 3, 4, 5],
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100],
}
clf = XGBClassifier(learning_rate=0.2, n_estimators=300, objective='multi:softmax', num_class=3,
silent=True, n_jobs=4, random_state=27, scaled_pos_weight=1)
folds = 5
param_comb = 10
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(clf, param_distributions=params, n_iter=param_comb, scoring='f1_micro', n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001 )
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
random_search.fit(X_train, y_train)
timer(start_time) # timing ends here for "start_time" variable
print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best normalized gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
print(random_search.best_score_ * 2 - 1)
print('\n Best hyperparameters:')
print(random_search.best_params_)
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('xgb-random-grid-search-results-01.csv', index=False)
xgb2 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.6, gamma=0, learning_rate=0.2, max_delta_step=0,
max_depth=4, min_child_weight=4, missing=None, n_estimators=300,
n_jobs=4, nthread=None, num_class=3, objective='multi:softmax',
random_state=27, reg_alpha=1e-05, reg_lambda=1, scale_pos_weight=1,
scaled_pos_weight=1, seed=None, silent=True, subsample=1.0)
model = xgb2.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)
from sklearn.metrics import classification_report
print('Accuracy:', metrics.accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))